{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "source": [
    "##### **** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:\n",
    "```\n",
    "cd tokenization2arrow\n",
    "make venv \n",
    "source venv/bin/activate \n",
    "pip install jupyterlab\n",
    "pip install -U ipywidgets\n",
    "./venv/bin/jupyter lab\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### **** Configure the transform parameters. The set of dictionary keys holding Tokenization2Arrow configuration for values are as follows: \n",
    "| Name | Description|\n",
    "| -----|------------|\n",
    "|tkn_tokenizer | Tokenizer used for tokenization. It also can be a path to a pre-trained tokenizer. By defaut, `hf-internal-testing/llama-tokenizer` from HuggingFace is used |\n",
    "|tkn_tokenizer_args |Arguments for tokenizer. For example, `cache_dir=/tmp/hf,use_auth_token=Your_HF_authentication_token` could be arguments for tokenizer `bigcode/starcoder` from HuggingFace|\n",
    "|tkn_doc_id_column|Column contains document id which values should be unique across dataset|\n",
    "|tkn_doc_content_column|Column contains document content|\n",
    "|tkn_text_lang|Specify language used in the text content for better text splitting if needed|\n",
    "|tkn_chunk_size|Specify >0 value to tokenize each row/doc in chunks of characters (rounded in words)|\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Remove output folder\n",
    "!rm -rf output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set huggingface token to download llama tokenizer\n",
    "import os\n",
    "os.environ['HF_TOKEN'] = 'hf_XXXX'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dpk_tokenization2arrow.runtime import Tokenization2Arrow"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Tokenization2Arrow(\n",
    "        input_folder= \"test-data/ds02/input\",\n",
    "        output_folder= \"output\",\n",
    "        tkn_tokenizer=  \"hf-internal-testing/llama-tokenizer\",\n",
    "        tkn_chunk_size= 20_000).transform()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Explore output folder"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[01;34moutput\u001b[0m\n",
      "├── \u001b[00mdf_17m.arrow\u001b[0m\n",
      "├── \u001b[01;34mmeta\u001b[0m\n",
      "│   ├── \u001b[00mdf_17m.docs\u001b[0m\n",
      "│   └── \u001b[00mdf_17m.docs.ids\u001b[0m\n",
      "└── \u001b[00mmetadata.json\u001b[0m\n",
      "\n",
      "2 directories, 4 files\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
      "To disable this warning, you can either:\n",
      "\t- Avoid using `tokenizers` before the fork if possible\n",
      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n"
     ]
    }
   ],
   "source": [
    "!tree output"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Check metadata.json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!cat output/metadata.json"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
